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Abstract 

Word embeddings - distributed word rep¬ 
resentations that ean be learned from un¬ 
labelled data - have been shown to have 
high utility in many natural language pro- 
eessing applieations. In this paper, we 
perform an extrinsie evaluation of five 
popular word embedding methods in the 
eontext of four sequenee labelling tasks: 
POS-tagging, syntaetie ehunking, NER 
and MWE identifieation. A partieular fo- 
eus of the paper is analysing the effeets 
of task-based updating of word represen¬ 
tations. We show that when using word 
embeddings as features, as few as sev¬ 
eral hundred training instanees are sufti- 
eient to aehieve eompetitive results, and 
that word embeddings lead to improve¬ 
ments over OOV words and out of domain. 
Perhaps more surprisingly, our results in- 
dieate there is little differenee between the 
different word embedding methods, and 
that simple Brown elusters are often eom¬ 
petitive with word embeddings aeross all 
tasks we eonsider. 

1 Introduction 

Recently, distributed word representations have 
grown to become a mainstay of natural language 
processing (NEP), and been show to have empir¬ 
ical utility in a myriad of tasks (Collobert and 
Weston, 2008; Turian et al., 2010; Baroni et al., 
2014; Andreas and Klein, 2014). The underly¬ 
ing idea behind distributed word representations 
is simple: to map each word w in our vocabu¬ 
lary V onto a continuous-valued vector of dimen¬ 
sionality d \V\. Words that are similar (e.g., 
with respect to syntax or lexical semantics) will 
ideally be mapped to similar regions of the vec¬ 
tor space, implicitly supporting both generalisa¬ 


tion across in-vocabulary (IV) items, and counter¬ 
ing the effects of data sparsity for low-frequency 
and out-of-vocabulary (OOV) items. 

Without some means of automatically deriv¬ 
ing the vector representations without reliance on 
labelled data, however, word embeddings would 
have little practical utility. Eortunately, it has 
been shown that they can be “pre-trained” from 
unlabelled text data using various algorithms to 
model the distributional hypothesis (i.e., that 
words which occur in similar contexts tend to be 
semantically similar). Pre-training methods have 
been refined considerably in recent years, and 
scaled up to increasingly large corpora. 

As with other machine learning methods, it is 
well known that the quality of the pre-trained word 
embeddings depends heavily on factors including 
parameter optimisation, the size of the training 
data, and the fit with the target application. Eor 
example, Turian et al. (2010) showed that the op¬ 
timal dimensionality for word embeddings is task- 
specific. One factor which has received relatively 
little attention in NEP is the effect of “updating” 
the pre-trained word embeddings as part of the 
task-specific training, based on self-taught learn¬ 
ing (Raina et al., 2007). Updating leads to word 
representations that are task-specific, but often at 
the cost of over-fitting low-frequency and OOV 
words. 

In this paper, we perform an extensive evalu¬ 
ation of four recently proposed word embedding 
approaches under fixed experimental conditions, 
applied to four sequence labelling tasks: POS- 
tagging, full-text chunking, named entity recog¬ 
nition (NER), and multiword expression (MWE) 
identification. Compared to previous empirical 
studies (Collobert et al., 2011; Turian et al., 
2010; Pennington et al., 2014), we fill their gaps 
by considering more word embedding approaches 
and evaluating them with more sequence labelling 
tasks. In addition, we explore the following re- 



search questions: 

RQl: are these word embeddings better than 
baseline approaches of one-hot unigram 
features and Brown clusters? 

RQ2: do word embeddings require less training 
data (i.e. generalise better) than one-hot un¬ 
igram features? If so, to what degree can 
word embeddings reduce the amount of la¬ 
belled data? 

RQ3: what is the impact of updating word em¬ 
beddings in sequence labelling tasks, both 
empirically over the target task and geo¬ 
metrically over the vectors? 

RQ4: what is the impact of these word embed¬ 
dings (with and without updating) on both 
OOV items (relative to the training data) 
and out-of-domain data? 

RQ5: overall, are some word embeddings better 
than others in a sequence labelling context? 

2 Word Representations 

2,1 Types of Word Representations 

Turian et al. (2010) identifies three varieties 
of word representations: distributional, cluster- 
based, and distributed. 

Distributional representation methods map 
each word m to a context word vector C^^,, 
which is constructed directly from co-occurrence 
counts between w and its context words. The 
learning methods either store the co-occurrence 
counts between two words w and i directly 
in Cwi (Sahlgren, 2006; Turney et ah, 2010; 
Honkela, 1997) or project the concurrence 
counts between words into a lower dimensional 
space (Rehufek and Sojka, 2010; Lund and 
Burgess, 1996), using dimensionality reduction 
techniques such as SVD (Dumais et ah, 1988) and 
LDA (Blei et ah, 2003). 

Cluster-based representation methods build 
clusters of words by applying either soft or hard 
clustering algorithms (Lin and Wu, 2009; Li and 
McCallum, 2005). Some of them also rely on 
a co-occurrence matrix of words (Pereira et ah, 
1993). The Brown clustering algorithm (Brown 
et ah, 1992) is the best-known method in this cat¬ 
egory. 

Distributed representation methods usu¬ 
ally map words into dense, low-dimensional, 
continuous-valued vectors, with x G 7?*^, where d 
is referred to as the word dimension. 


2.2 Selected Word Representations 

Over a range of sequence labelling tasks, we 
evaluate five methods for inducing word rep¬ 
resentations: Brown clustering (Brown et ah, 
1992) (“Brown”), the neural language model 
of Collobert & Weston (“CW”) (Collobert et 
ah, 2011), the continuous bag-of-words model 
(“CBOW”) (Mikolov et ah, 2013a), the continu¬ 
ous skip-gram model (“Skip-GRAM”) (Mikolov et 
ah, 2013b), and Global vectors (“Glove”) (Pen¬ 
nington et ah, 2014). With the exception of CW, 
all have have been shown to be at or near state- 
of-the-art in recent empirical studies (Turian et 
ah, 2010; Pennington et ah, 2014). CW is in¬ 
cluded because it was highly influential in earlier 
research, and the pre-trained embeddings are still 
used to some degree in NLP. The training of these 
word representations is unsupervised: the com¬ 
mon underlying idea is to predict occurrence of 
words in the neighbouring context. Their training 
objectives share the same form, which is a sum of 
local training factors J{w, ctx{w)), 

L = ^ .J{w,ctx{w)) 

wGV 


where V is the vocabulary of a given corpus, 
and ctx(ru) denotes the local context of word w. 
The local context of a word can either be its previ¬ 
ous k words, or the k words surrounding it. Local 
training factors are designed to capture the rela¬ 
tionship between w and its local contexts of use, 
either by predicting w based on its local context, 
or using w to predict the context words. Other than 
Brown, which utilises a cluster-based represen¬ 
tation, all the other methods employ a distributed 
representation. 

The starting point for CBOW and Skip-GRAM 
is to employ softmax to predict word occurrence: 


J{w,ctx{w)) = — log 


exp(v^Vctx(^)) 

Ejevexp(vJvctx(u,)) 


where Vctx(«,) denotes the distributed representa¬ 
tion of the local context of word w. CBOW de¬ 
rives Vctx(u,) based on averaging over the context 
words. That is, it estimates the probability of each 
w given its local context. In contrast, Skip-GRAM 
applies softmax to each context word of a given 
occurrence of word w. In this case, Vctx(«,) corre¬ 
sponds to the representation of one of its context 
words. This model can be characterised as predict- 



ing context words based on w. In practice, soft- 
max is too expensive to compute over large cor¬ 
pora, and thus Mikolov et al. (2013b) use hierar¬ 
chical softmax and negative sampling to scale up 
the training. 

CW considers the local context of a word w to 
be m words to the left and m words to the right of 
w. The concatenation of the embeddings of w and 
all its context words are taken as input to a neural 
network with one hidden layer, which produces a 
higher level representation f{w) G Then the 
learning procedure replaces the embedding of w 
with that of a randomly sampled word w' and gen¬ 
erates a second representation f{w') G with 
the same neural network. The training objective is 
to maximise the difference between them: 

J{w, ctx(r(;)) = max(0,1 — f{w) -|- f{w')) 

This approach can be regarded as negative sam¬ 
pling with only one negative example. 

Glove assumes the dot product of two word 
embeddings should be similar to the logarithm of 
the co-occurrence count Xij of the two words. As 
such, the local factor J{w, ctx(m)) becomes: 

g{Xij){vJvj + bi + bj - log{Xij)f 

where bi and bj are the bias terms of words i and 
j, respectively, and g{Xij) is a weighting function 
based on the co-occurrence count. This weight¬ 
ing function controls the degree of agreement be¬ 
tween the parametric function vjvj + hi + bj and 
log{Xij). Frequently co-occurring word pairs will 
be larger weight than infrequent pairs, up to a 
threshold. 

Brown partitions words into a finite set of 
word classes V. The conditional probability of 
seeing the next word is defined fo be: 

p{wk\w^zln) = p{wk\hk)p{hk\hlzln) 

where hk denofes fhe word class of fhe word 
Wk, previous m words and 

hkZln •^heir respecfive word classes. Then 
J{w,ctx{w)) = —iogp{wk\w^zln)- Since fhere 
is no fracfable mefhod fo find an optimal parfi- 
fion of word classes, fhe mefhod uses only a bi¬ 
gram class model, and ufilises hierarchical clus- 
fering as an approximafion mefhod fo find a suffi- 
cienfly good parfifion of words. 


Data set 

Size 

Words 

UMBC 

48.1GB 

3G 

One Billion 

4.1GB 

IG 

English Wikipedia 

49.6GB 

3G 


Table 1: Corpora used fo pre-frain fhe word em¬ 
beddings 

2.3 Building Word Representations 

For a fair comparison, we train Brown, CBOW, 
Skip-gram, and Glove on a fixed corpus, com¬ 
prised of freely available corpora, as detailed in 
Tab. 1. The joint corpus was preprocessed with the 
Stanford CoreNLP sentence splitter and tokeniser. 
All consecutive digit substrings were replaced by 
NUM/, where/ is the length of the digit substring 
(e.g., 10.20 is replaced by NUM2.NUM2. Due to 
the computational complexity of the pre-training, 
for CW, we simply downloaded the pre-compiled 
embeddings from: http://metaoptimize . 
com/projacts/wordreprs. 

The dimensionality of the word embeddings 
and the size of the context window are the key hy¬ 
perparameters when learning distributed represen¬ 
tations. We use all combinations of the following 
values to train word embeddings on the combined 
corpus: 

• Embedding dim. d G {25,50,100, 200} 

• Context window size m G {1,5,10} 

Brown requires only the number of clusters as a 
hyperparameter. We perform clustering with b G 
{250, 500,1000, 2000,4000} clusters. 

3 Sequence Labelling Tasks 

We evaluate the different word representations 
over four sequence labelling tasks: POS-tagging 
(POS-tagging), full-text chunking (Chunking), 
NER (NER) and MWE identification (MWE). Eor 
each task, we fed features into a first order linear- 
chain graph transformer (Collobert et al., 2011) 
made up of two layers: the upper layer is identi¬ 
cal to a linear-chain CRE (Eafferty et al., 2001), 
and the lower layer consists of word representa¬ 
tion and hand-crafted features. If we treat word 
representations as fixed, the graph transformer is 
a simple linear-chain CRE. On the other hand, if 
we can treat the word representations as model pa¬ 
rameters, the model is equivalent to a neural net¬ 
work with word embeddings as the input layer. We 



trained all models using AdaGrad (Duehi et al, 

2011 ). 

As in Turian et al. (2010), at each word position, 
we construct word representation features from 
the words in a context window of size two to either 
side of the target word, based on the pre-trained 
representation of each word type. For Brown, 
the features are the prefix features extracted from 
word clusters in the same way as Turian et al. 
(2010). As a baseline (and to test RQl), we in¬ 
clude a one-hot representation (which is equiva¬ 
lent to a linear-chain CRF with only lexical con¬ 
text features). 

Our hand-crafted features for POS-tagging, 
Chunking and MWE, are those used by Collobert 
et al. (2011), Turian et al. (2010) and Schneider 
et al. (2014b), respectively. For NER, we use the 
same feature space as Turian et al. (2010), except 
for the previous two predictions, because we want 
to evaluate all word representations with the same 
type of model - a first-order graph transformer. 

In training the distributed word representations, 
we consider two settings: (1) the word represen¬ 
tations are fixed during sequence model fraining; 
and (2) fhe graph Iransformer updafed fhe foken- 
level word represenfafions during fraining. 

As ouflined in Tab. 2, for each sequence la¬ 
belling fask, we experimenf over fhe de facto cor¬ 
pus, based on pre-exisfing fraining-dev-fesf splifs 
where available:^ 

POS-tagging: Ihe Wall Slreel Journal porfion 
of fhe Penn Treebank (Marcus el al. (1993): 
“WSJ”) wifh Penn POS lags 
Chunking: Ihe Wall Slreel Journal porfion of fhe 
Penn Treebank (“WSJ”), converted into lOB- 
slyle full-lexl chunks using fhe CoNLL con¬ 
version scripls for fraining and dev, and fhe 
WSJ-derived CoNLL-2000 full lexl chunk¬ 
ing lesl dala for lesling (Tjong Kim Sang and 
Buchholz, 2000) 

NER: fhe English porfion of fhe CoNLL-2003 
English Named Enlily Recognition dala sel, 
for which fhe source dala was laken from 
Reuters newswire articles (Tjong Kim Sang 
and De Meulder (2003): “Reuters”) 

MWE: fhe MWE dalasel of Schneider el al. 
(2014b), over a porfion of lexl from fhe En¬ 
glish Web Treebank^ (“EWT”) 

*For the MWE dataset, no such split pre-existed, so we 
constructed our own. 

^https://catalog.Idc.upenn.edu/ 
LDC2012T13 


Eor all lasks olher lhan MWE,^ we addilionally 
have an oul-of-domain lesl sel, in order to eval¬ 
uate Ihe oul-of-domain robuslness of Ihe differenl 
word represenlalions, wilh and wilhoul updating. 
These dalasels are as follows: 

POS-tagging: ihe English Web Treebank wilh 
Penn POS lags (“EWT”) 

Chunking: ihe Brown Corpus portion of Ihe 
Penn Treebank (“Brown”), converted into 
lOB-slyle full-lexl chunks using Ihe CoNEE 
conversion scripls 

NER: Ihe MUC-7 named enlily recognition cor¬ 
pus"^ (“MUG 7”) 

Eor reproducibilily, we luned Ihe hyperparame- 
lers wilh random search over Ihe developmenl dala 
for each lask (Bergslra and Bengio, 2012). In Ihis, 
we randomly sampled 50 dislincl hyperparame- 
ler sels wilh Ihe same random seed for Ihe non¬ 
updating models (i.e. Ihe models lhal donT update 
Ihe word represenlalion), and sampled 100 dislincl 
hyperparameter sels for Ihe updating models (i.e. 
Ihe models lhal do). Eor each sel of hyperparam- 
elers and lask, we Irain a model over ils fraining 
sel and choose Ihe besl one based on ils perfor¬ 
mance on developmenl dala (Turian el ah, 2010). 
We also tone Ihe word represenlalion hyperparam- 
elers - namely, Ihe word vector size d and conlexl 
window size m (dislribuled represenlalions), and 
in Ihe case of Brown, Ihe number of clusters. 

Eor Ihe updating models, we found lhal Ihe re- 
sulls over Ihe lesl dala were always inferior to 
Ihose lhal do nol update Ihe word represenlalions, 
due to Ihe higher number of hyperparameters and 
small sample size (i.e. 100). Since Ihe Iwo-layer 
model of Ihe graph Iransformer conlains a dislincl 
sel of hyperparameters for each layer, we reuse Ihe 
besl-performing hyperparameter sellings from Ihe 
non-updating models, and only tone Ihe hyperpa- 
ramelers of AdaGrad for Ihe word represenlalion 
layer. This melhod requires only 32 additional 
runs and achieves consislenlly belter resulls lhan 
100 random draws. 

In order to lesl Ihe impacl of Ihe volume of 
fraining dala on Ihe differenl models (RQ2), we 
splil Ihe fraining sel into 10 partitions based on 
a base-2 log scale (i.e., Ihe second smallesl par¬ 
tition will be Iwice Ihe size of Ihe smallesl parti¬ 
tion), and created 10 successively larger fraining 

^Unfortunately, there is no second domain which has been 
hand-tagged with MWEs using the method of Schneider et al. 
(2014b) to use as an out-of-domain test corpus. 

‘*https://catalog.ldc.upenn.edu/LDC2001T02 




Training 

Development 

In-domain Test 

Out-of-domain Test 

POS-tagging 

WSJ Sec. 0-18 

WSJ Sec. 19-21 

WSJ Sec. 22-24 

EWT 

Chunking 

WSJ 

WSJ (IK sentences) 

WSJ (CoNLL-00 test) 

Brown 

NER 

Reuters (CoNLL-03 train) Reuters (CoNLL-03 dev) 

Reuters (CoNLL-03 test) 

MUC7 

MWE 

EWT (500 docs) 

EWT (100 docs) 

EWT (123 docs) 

— 


Table 2: Training, development and test (in- and out-of-domain) data for eaeh sequenee labelling task. 


sets by merging these partitions from the smallest 
one to the largest one, and used eaeh of these to 
train a model. From these, we eonstruet learning 
eurves over eaeh task. 

For ease of eomparison with previous results, 
we evaluate both in- and out-of-domain using 
ehunk/entity/expression-level FI-measure (“FI”) 
for all tasks exeept POS-tagging, for whieh we 
use token-level aeeuraey (“Acc”). To test per- 
formanee over OOV (unknown) tokens - i.e., the 
words that do not oeeur in the training set - we 
use token-level aeeuraey for all tasks (e.g. for 
Chunking, we evaluate whether the full lOB tag 
is eorreet or not), due to the sparsity of all-OOV 
ehunks/NEs/MWEs. 

4 Experimental Results and Discussion 

We strueture our evaluation by stepping through 
eaeh of our five researeh questions (RQl-5) from 
the start of the paper. In this, we make refer- 
enee to: (1) the best-performing method both in- 
and out-of-domain vs. the state-of-the-art (Tab. 3); 
(2) a heat map for eaeh task indieating the eon- 
vergenee rate for eaeh word representation, with 
and without updating (Eig. 1); (3) OOV aeeuraey 
both in-domain and out-of-domain for eaeh task 
(Eig. 2); and (4) visualisation of the impaet of 
updating on word embeddings, based on t-SNE 
(Eig. 3). 

RQl: Are the selected word embeddings better 
than one-hot unigram features and Brown clus¬ 
ters? As shown in Tab. 3, the best-performing 
method for every task exeept in-domain Chunk¬ 
ing is a word embedding method, although the 
preeise method varies greatly. Eig. 1, on the other 
hand, tells a more subtle story: the differenee be¬ 
tween Unigram and the other word representa¬ 
tions is relatively modest, esp. as the amount of 
training data inereases. Additionally, the differ¬ 
enee between Brown and the word embedding 
methods is modest aeross all tasks. So, the over¬ 
all answer would appear to be: yes for unigrams 
when there is little training data, but not really for 
Brown. 


RQ2: Do word embedding features require 
less training data? Eig. 1 shows that for 
POS-tagging and NER, with only several hun¬ 
dred training instanees, word embedding fea¬ 
tures aehieve superior results to Unigram. Eor 
example, when trained with 561 instanees, the 
POS-tagging model using Skip-gram-i-UP em¬ 
beddings is 5.3% above UNIGRAM; and when 
trained with 932 instanees, the NER model us¬ 
ing Skip-gram is 11.7% above Unigram. Sim¬ 
ilar improvements are also found for other types 
of word embeddings and Brown, when the train¬ 
ing set is small. However, all word representa¬ 
tions perform similarly for Chunking regardless 
of training data size. Eor MWE, Brown performs 
slightly better than the other methods when trained 
with approximately 25% of the training instanees. 
Therefore, we eonjeeture that the POS-tagging 
and NER tasks benefit more from distributional 
similarity than Chunking and MWE. 

RQ3: Does task-specific updating improve all 
word embeddings across all tasks? Based on 
Pig. 1, updating of word representations ean 
equally eorreet poorly-learned word representa¬ 
tions, and harm pre-trained representations, due 
to overfitting. Eor example. Glove perform sig- 
nifieantly worse than Skip-GRAM in both POS- 
tagging and NER without updating, but with up¬ 
dating, the gap between their results and the best 
performing method beeomes smaller. In eontrast. 
Skip-gram performs worse over the test data 
with updating, despite the results on the develop¬ 
ment set improving by 1%. 

To further investigate the effeets of updating, 
we sampled 60 words and plotted the ehanges 
in their word embeddings under updating, us¬ 
ing 2-d veetor fields generated by using mat- 
plotlib and t-SNE (van der Maaten and Hinton, 
2008). Half of the words were ehosen manu¬ 
ally to inelude known word elusters sueh as days 
of the week and names of eountries; the other 
half were seleeted randomly. Additional plots 
with 100 randomly-sampled words and the top- 
100 most frequent words, for all the methods and 
all the tasks, ean be found in the supplementary 



Task 

Benchmark 

In-domain Test set 

Out-of-domain Test set 

POS-tagging (Acc) 

Chunking (FI) 

NER (FI) 

MWE (FI) 

0.972 (Toutanova et al., 2003) 
0.942 (Sha and Pereira, 2003) 
0.893 (Ando and Zhang, 2005) 
0.625 (Schneider et al., 2014a) 

0.959 (Skip-gram+UP) 
0.938 (BrowN(,=2ooo) 
0.868 (Skip-gram) 

0.654 (CBOW+UP) 

0.910 (Skip-gram) 

0.676 (Glove) 

0.736 (Skip-gram) 


Table 3: State-of-the-art results vs. our best results for in-domain and out-of-domain test sets. 




(c)NER(Fl) (d)MWE(Fl) 

Figure 1: Results for each type of word representation over POS-tagging, Chunking, NER and MWE, 
optionally with updating (“-i-UP”)- The y-axis indicates the training data sizes (on a log scale). Green 
= high performance, and red = low performance, based on a linear scale of the best- to worst-result for 
each task. 


material and at https ://123abcl23abd. 
wordpress . com/. In each plot, a single arrow 
signifies one word, pointing from the position of 
the original word embedding to the updated repre¬ 
sentation. 

In Fig. 3, we show vector fields plofs for 
Chunking and NER using Skip-gram embed¬ 
dings. For Chunking, mosf of fhe vectors were 
changed wifh similar magnifude, buf in very dif- 
ferenf directions, including wifhin fhe clusfers of 
days of fhe week and counfry names. In confrasf, 
for NER, fhere was more homogeneous change in 
word vecfors belonging to fhe same clusfer. This 
greater consisfency is furfher evidence fhaf seman¬ 
tic homogeneify appears fo be more beneficial for 
NER fhan Chunking. 


RQ4: What is the impact of word embeddings 
cross-domain and for OOV words? As shown 
in Tab. 3, results predictably drop when we eval¬ 
uate out of domain. The difference is most pro¬ 
nounced for Chunking, where there is an absolute 
drop in FI of around 30% for all methods, indi¬ 
cating that word embeddings and unigram features 
provide similar information for Chunking. 

Another interesting observation is that updating 
often hurts out-of-domain performance because 
the distribution between domains is different. This 
suggests that, if the objective is to optimise per¬ 
formance across domains, it is best not to perform 
updating. 

We also analyze performance on OOV words 
both in-domain and out-of-domain in Fig. 2. As 
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Figure 2: Acc over out-of-vocabulary (OOV) words for in-domain and out-of-domain test sets. 




(b) NER 

Figure 3: A t-SNE plot of the impact of updating on SKIP-GRAM 


expected, word embeddings and Brown excel generalisation, the OOV results are better when 
in out-of-domain OOV performance. Consistent updating is not performed, 
with our overall observations about cross-domain 


















































RQ5 Overall, are some word embeddings bet¬ 
ter than others? Comparing the different word 
embedding teehniques over our four sequenee la¬ 
belling tasks, for the different evaluations (overall, 
out-of-domain and OOV), there is no elear winner 
among the word embeddings - for POS-tagging, 
Skip-gram appears to have a slight advantage, 
but this does not generalise to other tasks. 

While the aim of this paper was not to aehieve 
the state of the art over the respeetive tasks, it is 
important to eoneede that our best (in-domain) re¬ 
sults for NER, POS-tagging and Chunking are 
slightly worse than the state of the art (Tab. 3). 
The 2.7% differenee between our NER system 
and the best performing system is due to the faet 
that we use a first-order instead of a seeond-order 
CRF (Ando and Zhang, 2005), and for the other 
tasks, there are similarly differenees in the learner 
and the eomplexity of the features used. Another 
differenee is that we tuned the hyperparameters 
with random seareh, to enable replieation using 
the same random seed. In eontrast, the hyperpa¬ 
rameters for the state-of-the-art methods are tuned 
more extensively by experts, making them more 
diffieult to reproduee. 

5 Related Work 

Collobert et al. (2011) proposed a unified neural 
nefwork framework fhaf learns word embeddings 
and applied if for POS-tagging, Chunking, NER 
and semanfie role labelling. When they eombined 
word embeddings with hand erafted features (e.g., 
word suffixes for POS-tagging; gazetteers for 
NER) and applied other trieks like easeading and 
elassifier eombination, they aehieved state-of-the- 
art performanee. Similarly, Turian et al. (2010) 
evaluated three different word representations on 
NER and Chunking, and eoneluded that unsu¬ 
pervised word representations improved NER and 
Chunking. They also found that eombining dif¬ 
ferent word representations ean further improve 
performanee. Guo et al. (2014) also explored dif¬ 
ferent ways of using word embeddings for NER. 
Owoputi et al. (2013) and Sehneider et al. (2014a) 
found that Brown elustering enhanees Twitter 
POS tagging and MWE, respeetively. Compared 
to previous work, we eonsider more word rep¬ 
resentations ineluding the most reeent work and 
evaluate them on more sequenee labelling tasks, 
wherein the models are trained with training sets 


of varying size. 

Bansal et al. (2014) reported that direet use of 
word embeddings in dependeney parsing did not 
show improvement. They aehieved an improve¬ 
ment only when they performed hierarehieal elus¬ 
tering of the word embeddings, and used features 
extraeted from the eluster hierarehy. In a simi¬ 
lar vein, Andreas and Klein (2014) explored the 
use of word embeddings for eonstitueney pars¬ 
ing and eoneluded that the information eontained 
in word embeddings might duplieate the one ae- 
quired by a syntaetie parser, unless the training set 
is extremely small. Other syntaetie parsing studies 
that reported improvements by using word embed¬ 
dings inelude Koo et al. (2008), Koo et al. (2010), 
Haffari et al. (2011), Tratz and Hovy (2011) and 
Chen and Manning (2014). 

Word embeddings have also been applied to 
other (non-sequential NLP) tasks like grammar in- 
duetion (Spitkovsky et ah, 2011), and semantie 
tasks sueh as semantie relatedness, synonymy de- 
teetion, eoneept eategorisation, seleetional prefer- 
enee learning and analogy (Baroni et ah, 2014). 

Huang and Yates (2009) demonstrated that us¬ 
ing distributional word representations methods 
(like TF-IDF and LSA) as features, improves the 
labelling of OOV, when test for POS-tagging and 
Chunking. In our study, we evaluate the labelling 
performanee of OOV words for updated vs. not 
updated word embeddings representations, rela¬ 
tive to the training set and with out-of-domain 
data. 

6 Conclusions 

We have performed an extensive extrinsie evalua¬ 
tion of four word embedding methods under fixed 
experimental eonditions, and evaluated their appli- 
eability to four sequenee labelling tasks: POS- 
tagging, Chunking, NER and MWE identifiea- 
tion. We found that word embedding features re¬ 
liably outperformed unigram features, espeeially 
with limited training data, but that there was rela¬ 
tively little differenee over Brown elusters, and no 
one embedding method was eonsistently superior 
aeross the different tasks and settings. Word em¬ 
beddings and Brown elusters were also found to 
improve out-of-domain performanee and for OOV 
words. We expeeted a performanee gap between 
the fixed and task-updated embeddings, but the ob¬ 
served differenee was marginal. Indeed, we found 
that updating ean result in overfitting. We also ear- 



ried out preliminary analysis of the impaet of up¬ 
dating on the veetors, a direetion whieh we intend 

to pursue further. 
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